Categories

Versions

Extract Topics from Documents (LDA) (Operator Toolbox)

Synopsis

This operator finds topics using the LDA method.

Description

LDA (Latent Dirichlet Allocation) is a method which allows you to identify topics in documents. This implementation of LDA uses the ParallelTopicModel of the Mallet library (source: Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009)) with SparseLDA sampling scheme and data structure (source: Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)).

LDA provides topic diagnostics in the model object. For details on the measures see: http://mallet.cs.umass.edu/diagnostics.php . Note that some measures depend on the number of top words.

LDA uses Gibbs Sampling for the application of the model. The method exposes additional parameters in the application.

Input

  • col (Collection)

    A preprocessed collection of documents.

Output

  • exa (Data table)

    An ExampleSet with added "documentId" and "TopicId" attributes, and an additional attribute showing the confidence that this document belongs to the topic.

  • top (Data table)

    An ExampleSet with details on the topic. For each topic the operator returns the top 5 most used words.

  • mod

    The topic model. It can be applied to new collection of documents using Apply Model (Documents).

  • per (Averagable)

    The LogLikelihood value of the fit which can be used for optimization.

Parameters

  • number of topics Number of topics to search.
  • use alpha heuristics If this parameter is set to true, alpha is automatically set. The used heuristics is: 50 / Number of topics.
  • alpha sum Bayesian prior on the topic distribution.
  • use beta heuristics If this parameter is set to true, beta will be automatically set. The used heuristics is: 50 / Number of words.
  • beta Bayesian prior on the word distribution.
  • optimize hyperparameters If this parameter is set to true, both alpha and beta will be optimized every k-th step. k can be provided by the "optimize interval for hyperparameters" parameter
  • optimize interval for hyperparameters Frequency of hyperparameter optimization.
  • top words per topic Number of words to pull to describe one topic.
  • iterations Number of iterations for optimization.
  • reproducible If this parameter is set to true, parallel execution will be deactivated. Results may differ between runs if this is left unchecked.
  • enable logging If this parameter is set to true, additional output is provided in the Log panel.
  • use local random seed This parameter indicates if a local random seed should be used.
  • local random seed If the use local random seed parameter is checked this parameter determines the local random seed.
  • include meta data If checked, available meta information of the text like filename, date is added as attribute.
  • LDA.iterations (Application) Number of iterations for Gibbs sampling. Available in Apply Model (Documents).
  • LDA.burnin (Application) Ignore the first x rounds of sampling. Should be > iterations. Available in Apply Model (Documents).
  • LDA.thinning (Application) Only use every x-th iteration to determine the confidence. Available in Apply Model (Documents).

Tutorial Processes

A simple application on lorem ipsum

This sample process is a minimalist example for LDA. It generates a collection of documents based on Lorem Ipusm, processes them using the Text Processing extension, and feeds it into the LDA operator.